Search results for "Content extraction"
showing 4 items of 4 documents
Readability and the Web
2012
Readability indices measure how easy or difficult it is to read and comprehend a text. In this paper we look at the relation between readability indices and web documents from two different perspectives. On the one hand we analyse how to reliably measure the readability of web documents by applying content extraction techniques and incorporating a bias correction. On the other hand we investigate how web based corpus statistics can be used to measure readability in a novel and language independent way.
Content Code Blurring: A New Approach to Content Extraction
2008
Most HTML documents on the world wide web contain far more than the article or text which forms their main content. Navigation menus, functional and design elements or commercial banners are typical examples of additional contents. Content extraction is the process of identifying the main content and/or removing the additional contents. We introduce content code blurring, a novel content extraction algorithm. As the main text content is typically a long, homogeneously formatted region in a web document, the aim is to identify exactly these regions in an iterative process. Comparing its performance with existing content extraction solutions we show thatfor most documents content code blurrin…
Combining content extraction heuristics
2008
The main text content of an HTML document on the WWW is typically surrounded by additional contents, such as navigation menus, advertisements, link lists or design elements. Content Extraction (CE) is the task to identify and extract the main content. Ongoing research has spawned several CE heuristics of different quality. However, so far only the Crunch framework combines several heuristics to improve its overall CE performance. Since Crunch, though, many new algorithms have been formulated. The CombinE system is designed to test, evaluate and optimise combinations of CE heuristics. Its aim is to develop CE systems which yield better and more reliable extracts of the main content of a web …
Estimating web site readability using content extraction
2009
Nowadays, information is primarily searched on the WWW. From a user perspective, the readability is an important criterion for measuring the accessibility and thereby the quality of an information. We show that modern content extraction algorithms help to estimate the readability of a web document quite accurate.